31 research outputs found
Model-free trajectory optimization for reinforcement learning
Many of the recent Trajectory Optimization algorithms alternate between local approximation of the dynamics and conservative policy update.
However, linearly approximating the dynamics in order to derive the new policy can bias the update and prevent convergence to the optimal policy.
In this article, we propose a new model-free algorithm that backpropagates a local quadratic time-dependent Q-Function, allowing the derivation
of the policy update in closed form. Our policy update ensures exact KL-constraint satisfaction without simplifying assumptions on the system
dynamics demonstrating improved performance in comparison to related Trajectory Optimization algorithms linearizing the dynamics
Local Bayesian optimization of motor skills
Bayesian optimization is renowned for its sample
efficiency but its application to higher dimensional
tasks is impeded by its focus on global
optimization. To scale to higher dimensional
problems, we leverage the sample efficiency of
Bayesian optimization in a local context. The
optimization of the acquisition function is restricted
to the vicinity of a Gaussian search distribution
which is moved towards high value areas
of the objective. The proposed informationtheoretic
update of the search distribution results
in a Bayesian interpretation of local stochastic
search: the search distribution encodes prior
knowledge on the optimum’s location and is
weighted at each iteration by the likelihood of
this location’s optimality. We demonstrate the
effectiveness of our algorithm on several benchmark
objective functions as well as a continuous
robotic task in which an informative prior is obtained
by imitation learning
Empowered skills
Robot Reinforcement Learning (RL) algorithms
return a policy that maximizes a global cumulative reward
signal but typically do not create diverse behaviors. Hence, the
policy will typically only capture a single solution of a task.
However, many motor tasks have a large variety of solutions
and the knowledge about these solutions can have several
advantages. For example, in an adversarial setting such as
robot table tennis, the lack of diversity renders the behavior
predictable and hence easy to counter for the opponent. In an
interactive setting such as learning from human feedback, an
emphasis on diversity gives the human more opportunity for
guiding the robot and to avoid the latter to be stuck in local
optima of the task. In order to increase diversity of the learned
behaviors, we leverage prior work on intrinsic motivation and
empowerment. We derive a new intrinsic motivation signal by
enriching the description of a task with an outcome space,
representing interesting aspects of a sensorimotor stream. For
example, in table tennis, the outcome space could be given
by the return position and return ball speed. The intrinsic
motivation is now given by the diversity of future outcomes,
a concept also known as empowerment. We derive a new
policy search algorithm that maximizes a trade-off between
the extrinsic reward and this intrinsic motivation criterion.
Experiments on a planar reaching task and simulated robot
table tennis demonstrate that our algorithm can learn a diverse
set of behaviors within the area of interest of the tasks
Layered direct policy search for learning hierarchical skills
Solutions to real world robotic tasks often require
complex behaviors in high dimensional continuous state and
action spaces. Reinforcement Learning (RL) is aimed at learning
such behaviors but often fails for lack of scalability. To
address this issue, Hierarchical RL (HRL) algorithms leverage
hierarchical policies to exploit the structure of a task. However,
many HRL algorithms rely on task specific knowledge such
as a set of predefined sub-policies or sub-goals. In this paper
we propose a new HRL algorithm based on information
theoretic principles to autonomously uncover a diverse set
of sub-policies and their activation policies. Moreover, the
learning process mirrors the policys structure and is thus also
hierarchical, consisting of a set of independent optimization
problems. The hierarchical structure of the learning process
allows us to control the learning rate of the sub-policies and
the gating individually and add specific information theoretic
constraints to each layer to ensure the diversification of the subpolicies.
We evaluate our algorithm on two high dimensional
continuous tasks and experimentally demonstrate its ability to
autonomously discover a rich set of sub-policies
Model-Free Trajectory-based Policy Optimization with Monotonic Improvement
Many of the recent trajectory optimization algorithms alternate between linear approximation of the system dynamics around the mean trajectory and conservative policy update. One way of constraining the policy change is by bounding the Kullback-Leibler (KL) divergence between successive policies. These approaches already demonstrated great experimental success in challenging problems such as end-to-end control of physical systems. However, the linear approximation of the system dynamics can introduce a bias in the policy update and prevent convergence to the optimal policy. In this article, we propose a new model-free trajectory-based policy optimization algorithm with guaranteed monotonic improvement. The algorithm backpropagates a local, quadratic and time-dependent Q-Function learned from trajectory data instead of a model of the system dynamics. Our policy update ensures exact KL-constraint satisfaction without simplifying assumptions on the system dynamics. We experimentally demonstrate on highly non-linear control tasks the improvement in performance of our algorithm in comparison to approaches linearizing the system dynamics. In order to show the monotonic improvement of our algorithm, we additionally conduct a theoretical analysis of our policy update scheme to derive a lower bound of the change in policy return between successive iterations
Model-Free Trajectory-based Policy Optimization with Monotonic Improvement
Many of the recent trajectory optimization algorithms alternate between linear approximation
of the system dynamics around the mean trajectory and conservative policy update.
One way of constraining the policy change is by bounding the Kullback-Leibler (KL)
divergence between successive policies. These approaches already demonstrated great experimental
success in challenging problems such as end-to-end control of physical systems.
However, these approaches lack any improvement guarantee as the linear approximation of
the system dynamics can introduce a bias in the policy update and prevent convergence
to the optimal policy. In this article, we propose a new model-free trajectory-based policy
optimization algorithm with guaranteed monotonic improvement. The algorithm backpropagates
a local, quadratic and time-dependent Q-Function learned from trajectory data
instead of a model of the system dynamics. Our policy update ensures exact KL-constraint
satisfaction without simplifying assumptions on the system dynamics. We experimentally
demonstrate on highly non-linear control tasks the improvement in performance of our algorithm
in comparison to approaches linearizing the system dynamics. In order to show the
monotonic improvement of our algorithm, we additionally conduct a theoretical analysis of
our policy update scheme to derive a lower bound of the change in policy return between
successive iterations
Projections for Approximate Policy Iteration Algorithms
Approximate policy iteration is a class of reinforcement learning (RL) algorithms where the policy is encoded using a function approximator and which has been especially prominent in RL with continuous action spaces. In this class of RL algorithms, ensuring increase of the policy return during policy update often requires to constrain the change in action distribution. Several approximations exist in the literature to solve this constrained policy update problem. In this paper, we propose to improve over such solutions by introducing a set of projections that transform the constrained problem into an unconstrained one which is then solved by standard gradient descent. Using these projections, we empirically demonstrate that our approach can improve the policy update solution and the control over exploration of existing approximate policy iteration algorithms
Learning Replanning Policies with Direct Policy Search
Direct policy search has been successful in learning challenging real world robotic motor skills by learning open-loop movement primitives with high sample efficiency. These primitives can be generalized to different contexts with varying initial configurations and goals. Current state-of-the-art contextual policy search algorithms can however not adapt to changing, noisy context measurements. Yet, these are common characteristics of real world robotic tasks. Planning a trajectory ahead based on an inaccurate context that may change during the motion often results in poor accuracy, especially with highly dynamical tasks. To adapt to updated contexts, it is sensible to learn trajectory replanning strategies. We propose a framework to learn trajectory replanning policies via contextual policy search and demonstrate that they are safe for the robot, that they can be learned efficiently and that they outperform non-replanning policies for problems with partially observable or perturbed contex
Sample and feedback efficient hierarchical reinforcement learning from human preferences
While reinforcement learning has led to promising results in robotics, defining an informative reward function can sometimes prove to be challenging. Prior work considered including the human in the loop to jointly learn the reward function and the optimal policy. Generating samples from a physical robot and requesting human feedback are both taxing efforts for which efficiency is critical. In contrast to prior work, in this paper we propose to learn reward functions from both the robot and the human perspectives in order to improve on both efficiency metrics. On one side, learning a reward function from the human perspective increases feedback efficiency by assuming that humans rank trajectories according to an outcome space of reduced dimensionaltiy. On the other side, learning a reward function from the robot perspective circumvents the need for learning a dynamics model while retaining the sample efficiency of model-based approaches. We provide an algorithm that incorporates bi-perspective reward learning into a general hierarchical reinforcement learning framework and demonstrate the merits of our approach on a toy task and a simulated robot grasping task
Compatible natural gradient policy search
Trust-region methods have yielded state-of-the-art results in policy search. A common approach is to use KL-divergence to bound the region of trust resulting in a natural gradient policy update. We show that the natural gradient and trust region optimization are equivalent if we use the natural parameterization of a standard exponential policy distribution in combination with compatible value function approximation. Moreover, we show that standard natural gradient updates may reduce the entropy of the policy according to a wrong schedule leading to premature convergence. To control entropy reduction we introduce a
new policy search method called compatible policy search (COPOS) which bounds entropy loss. The experimental results show that COPOS yields state-of-the-art results in challenging continuous control tasks and in discrete partially observable tasks